Missing Data and Imputation 1

Concepts, Challenges, and Strategies to Address Missing Data

Erik Westlund

2025-11-20

Why Missing Data Matters

  • Maternal health studies often rely on observational cohorts, so incomplete data is common.
  • Ignoring missingness can erase power, bias estimates, or make truth unknowable.
  • Today: simulation-first walkthrough with simple DAGs and linear models.

Simulated Data: Seeing Is Believing

  • Using simulated data allows us to see first hand how the concepts behind missing data strategies operate in practice.
  • They help build intuition for how missing data strategies succeed or fail.

Causal Thinking

  • Working with observational data does not mean we can skip causality; it makes it more important.
  • A defensible causal graph is needed to reason about missingness mechanisms.
  • Without it, multiple imputation can reinforce bias instead of fixing it.

Stylized Teaching Examples

  • The DAGs and simulations today are deliberately simple.
  • Real maternal-health data has more variables, feedback loops, and measurement issues.

Stylized Teaching Examples (cont.)

  • Treat these as “toy” examples to see how missingness mechanisms and fixes behave.
  • Next session: revisit with a more realistic, higher-dimensional DAG.

No Exit & Three Other Plays: Mechanisms

  • Everything is Perfect: Complete case analysis
  • Ignorable and Annoying: Missing Completely at Random (MCAR)
  • Dangerous but Manageable: Missing at Random (MAR)
  • No Exit: Missing Not at Random (MNAR)

Today’s Plan

  • Work through examples of MCAR and MAR to get an intuition for their implications
  • Examine common “fixes” to show how they fail and motivate why we need multiple imputation

Ground Truth

  • Nutrition effect on birthweight (\(β_X\)): 0.5.
  • Observed factor effect on birthweight (\(β_E\)): 1.
  • Standard deviation around effect size (error): 2.
  • Keep these in mind as we go; they are the “truth” for all our examples.
beta_x_true    <- 0.5   # effect of nutrition on birthweight
beta_e_true    <- 1     # effect of the observed factor (e.g., education) on birthweight
sd_error_true  <- 2     # residual SD

Simulation Helper (for reference)

simulate_birth_data
function (n = 1000, beta_x = 0.5, beta_e = 1, sd_error = 2, mcar_rate = 0.5, 
    mar_logit_shift = 1.2, mar_depend = c("E", "X")) 
{
    mar_depend <- match.arg(mar_depend)
    E <- rbinom(n, 1, 0.5)
    X <- 1.5 * E + rnorm(n)
    Y <- 2 + beta_x * X + beta_e * E + rnorm(n, sd = sd_error)
    p_miss_mcar <- rep(mcar_rate, n)
    miss_mcar <- rbinom(n, 1, p_miss_mcar)
    X_mcar <- ifelse(miss_mcar == 1, NA_real_, X)
    mar_var <- if (mar_depend == "E") {
        1 - E
    }
    else {
        -as.numeric(scale(X))
    }
    alpha <- if (mar_logit_shift == 0) {
        qlogis(mcar_rate)
    }
    else {
        uniroot(f = function(a) mean(plogis(a + mar_logit_shift * 
            mar_var)) - mcar_rate, interval = c(-15, 15))$root
    }
    p_miss_mar <- plogis(alpha + mar_logit_shift * mar_var)
    miss_mar <- rbinom(n, 1, p_miss_mar)
    X_mar <- ifelse(miss_mar == 1, NA_real_, X)
    tibble(E, X, Y, X_mcar, miss_mcar, X_mar, miss_mar)
}

Base Causal Structure: E → Nutrition → Birthweight ← E

  • Nutrition (X) → Birthweight (Y)
  • Another variable E affects both
  • No missingness yet; this is the target data-generating process.
  • Full data model uses Y ~ X + E as our ‘truth’.

Base Relationship: What’s in E?

  • E stands in for any observed factor that affects both nutrition and birthweight.
  • Examples: education, clinic context, or barriers to care.
  • We’re using a simple version to make the mechanics visible.

Causal Graph

Simulate Data & Fit the “Truth Model”

sim_data <- simulate_birth_data(
  n         = 5000,
  beta_x    = beta_x_true,
  beta_e    = beta_e_true,
  sd_error  = sd_error_true
)

full_fit <- lm(Y ~ X + E, data = sim_data)

Simulated Data Model Results


Call:
lm(formula = Y ~ X + E, data = sim_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6783 -1.3513 -0.0298  1.3986  7.7126 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.98314    0.04027   49.24   <2e-16 ***
X            0.50024    0.02878   17.38   <2e-16 ***
E            1.00465    0.07198   13.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.022 on 4997 degrees of freedom
Multiple R-squared:  0.2006,    Adjusted R-squared:  0.2003 
F-statistic: 627.1 on 2 and 4997 DF,  p-value: < 2.2e-16

Partial Relationship: Residualized X vs Y (conditioning on E)

MCAR: Missing Completely At Random

MCAR: The Happy Case

  • Missingness is truly random.
  • For MCAR, we randomly delete nutrition (X) for a subset of mothers.
  • Random missingness has no relationship to education or nutrition values.

MCAR: Impact

  • Missing data does not cause biased estimates. It just reduces precision/power.
  • Example: random equipment failures, accidentally dropped samples.

MCAR: Dag

Simulation Code: Base Data and MCAR

# Ensure base data exists (5000 rows defined earlier)
if (!exists("sim_data")) {
  sim_data <- simulate_birth_data(
    n         = 5000,
    beta_x    = beta_x_true,
    beta_e    = beta_e_true,
    sd_error  = sd_error_true
  )
}

# Full data model (gold standard) on the complete data
full_fit <- lm(Y ~ X + E, data = sim_data)

# Randomly mark 50% of X as missing (MCAR) for this demo
mcar_data <- sim_data |>
  mutate(
    drop_flag = rbinom(n(), 1, 0.5),
    X_mcar = if_else(drop_flag == 1, NA_real_, X)
  )

MCAR with Listwise Deletion

# Only keep rows where X_mcar is observed (listwise deletion)
mcar_cc <- mcar_data |> filter(!is.na(X_mcar))
mcar_fit <- lm(Y ~ X_mcar + E, data = mcar_cc)

nrow(mcar_cc)
[1] 2522
summary(mcar_fit)

Call:
lm(formula = Y ~ X_mcar + E, data = mcar_cc)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7022 -1.3534  0.0086  1.3901  7.7041 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.97448    0.05698   34.65   <2e-16 ***
X_mcar       0.47749    0.04102   11.64   <2e-16 ***
E            1.02528    0.10239   10.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.043 on 2519 degrees of freedom
Multiple R-squared:  0.1916,    Adjusted R-squared:  0.1909 
F-statistic: 298.5 on 2 and 2519 DF,  p-value: < 2.2e-16

Summary of Differences


Call:
lm(formula = Y ~ X_mcar + E, data = mcar_cc)

Coefficients:
(Intercept)       X_mcar            E  
     1.9745       0.4775       1.0253  
# A tibble: 2 × 3
  method            estimate     se
  <chr>                <dbl>  <dbl>
1 Full data (truth)    0.500 0.0288
2 MCAR listwise        0.477 0.0410

Power Analysis: MCAR vs. Complete Data

MCAR: estimates center on truth; uncertainty widens as more X is missing.

MCAR: Not a Big Deal But Rarely Plausible

  • MCAR is the happy case: just drop mothers with missing nutrition, move on.
  • You lose precision/power, but no bias.
  • Problem: MCAR is seldom plausible, except in a few cases (e.g., random measurement instrument failure)

MAR: Missing At Random

MAR: Nonresponse Driven by E and X

  • Missingness rises with an observed factor (e.g., education, clinic context) and when nutrition is low.
  • Education affects both nutrition and birthweight.
  • Dropping cases with missing values up-weights higher-education/higher-nutrition mothers, biasing the nutrition effect.

MAR Scenario Set

We will look at:

  • A single “truth” dataset (n = 20,000) with no missingness
  • MAR cases with 10%, 30%, and 60% of X missing
  • For each level: one scenario where missingness is driven by E (observable) and one where missingness is driven by X itself (worst case)

DAG

MAR Scenarios, summarized

  • No Y/E bias here; we only are missing on X here
  • Bias increases with more missing data
  • Bias increased when relationship between X, E, and p(missing) is higher
  • The highest bias scenario has a particularly strong link between the true value of X and p(x missing)
scenario N Y mean Y sd Y bias X mean X sd X bias E mean E sd E bias
Full data 20000 2.87 2.24 0 0.73 1.25 0.00 0.49 0.5 0
10% missing, E-driven 17898 2.87 2.24 0 0.81 1.25 0.08 0.49 0.5 0
10% missing, X-driven 18018 2.87 2.24 0 0.96 1.09 0.23 0.49 0.5 0
30% missing, E-driven 14020 2.87 2.24 0 0.99 1.23 0.26 0.49 0.5 0
30% missing, X-driven 13971 2.87 2.24 0 1.33 0.91 0.60 0.49 0.5 0
60% missing, E-driven 8121 2.87 2.24 0 1.29 1.12 0.56 0.49 0.5 0
60% missing, X-driven 8000 2.87 2.24 0 1.90 0.74 1.17 0.49 0.5 0

MAR Scenarios: Naive Regression (Y ~ X)

  • No adjustment for E; drop rows with missing X.
  • True \(β_X\) = 0.50
  • E affects both Y, X, and p(x missing); these models are both MNAR and mis-specified
  • MNAR and mis-specified models tend to go hand-in-hand.
Scenario N β_X SE(β_X)
Full data 20000 0.734 0.012
10% missing, E-driven 17898 0.734 0.012
10% missing, X-driven 18018 0.745 0.014
30% missing, E-driven 14020 0.710 0.014
30% missing, X-driven 13971 0.744 0.019
60% missing, E-driven 8121 0.626 0.020
60% missing, X-driven 8000 0.717 0.031

MAR Scenarios: Adjusted Regression (Y ~ X + E)

  • This model is correctly specified. “No backdoor paths.”
  • Observed scenarios drop cases with missing X.
  • True \(β_X\) = 0.50
  • Even without any imputation, the adjusted model recovers \(β_X\) reasonably well across scenarios.
  • The main consequence is power loss, not bias.
Scenario N β_X SE(β_X) β_E SE(β_E)
Full data 20000 0.497 0.014 0.988 0.035
10% missing, E-driven 17898 0.496 0.015 1.001 0.037
10% missing, X-driven 18018 0.493 0.017 0.985 0.036
30% missing, E-driven 14020 0.500 0.017 0.945 0.044
30% missing, X-driven 13971 0.501 0.021 0.998 0.040
60% missing, E-driven 8121 0.489 0.022 1.015 0.073
60% missing, X-driven 8000 0.560 0.032 0.886 0.059

When MAR Holds (and When It Doesn’t)

  • If we model the variables that drive missingness (here, X and E), listwise deletion analyses stay near the truth.
  • In that case, missing data mostly hurts precision: fewer cases translates to bigger SEs.
  • Trouble begins when key drivers are unmeasured or omitted (e.g., X-driven missingness with only Y ~ X): then MAR is violated relative to the analysis, and bias appears.
  • In real workflows we can combine modeling plus imputation; this performs best.

Common, Non-Ideal Imputation Tactics

  • We apply four historically common tactics to each MAR scenario (E-driven and X-driven at 10/30/60% missing) plus the full, no-missing baseline.
  • Every regression keeps \(E\) in the model (since we observe it); what changes is how we reconstruct \(X\).
  • Tables on the next slides show \(β_X\), standard errors, and how far each method drifts from the true 0.50.

Mean Imputation

  • Insert a single cohort-wide mean for every missing nutrition value.
  • Mechanics: compute the mean of observed \(X\), then plug that constant into each missing slot
  • All imputed mothers receive the same nutrition score, so their points line up on the same vertical slice of the scatter.

Mean Imputation: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise N β_X SE(β_X) β_X bias X mean X sd
Full data 0% 20000 0.497 0.014 20000 0.497 0.014 -0.003 0.73 1.25
10% missing, E-driven 11% 17898 0.496 0.015 20000 0.470 0.015 -0.030 0.81 1.18
10% missing, X-driven 10% 18018 0.493 0.017 20000 0.418 0.016 -0.082 0.96 1.03
30% missing, E-driven 30% 14020 0.500 0.017 20000 0.410 0.016 -0.090 0.99 1.03
30% missing, X-driven 30% 13971 0.501 0.021 20000 0.373 0.020 -0.127 1.33 0.76
60% missing, E-driven 59% 8121 0.489 0.022 20000 0.407 0.021 -0.093 1.29 0.71
60% missing, X-driven 60% 8000 0.560 0.032 20000 0.421 0.032 -0.079 1.90 0.47

Mean Imputation vs. Listwise Deletion

Mean Imputation: Problems

  • Even with \(E\) in the regression, bias is worse with mean imputation than just using listwise deletion.
  • In this case, shrinks the slope toward zero: missing cases on average have worse nutrition; we replace those missing values with the data from cases where highly educated, well-nourished mothers are over-represented.
  • Standard errors are too small because we flooded our data with identical, non-varying values.

Mean + Indicator

  • Start with mean imputation, then append a binary “missing” indicator to the regression.
  • Mechanics: fill missing \(X\) with the cohort mean, create miss_ind = 1 when \(X\) was imputed, and fit $Y X + E + miss_ind`.
  • Intuition: mothers with missing nutrition get the same slope as everyone else but a separate intercept shift governed by the indicator.

Mean + Indicator: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise N β_X SE(β_X) β_X bias X mean X sd
Full data 0% 20000 0.497 0.014 20000 0.497 0.014 -0.003 0.73 1.25
10% missing, E-driven 11% 17898 0.496 0.015 20000 0.492 0.015 -0.008 0.81 1.18
10% missing, X-driven 10% 18018 0.493 0.017 20000 0.490 0.016 -0.010 0.96 1.03
30% missing, E-driven 30% 14020 0.500 0.017 20000 0.459 0.016 -0.041 0.99 1.03
30% missing, X-driven 30% 13971 0.501 0.021 20000 0.488 0.020 -0.012 1.33 0.76
60% missing, E-driven 59% 8121 0.489 0.022 20000 0.409 0.021 -0.091 1.29 0.71
60% missing, X-driven 60% 8000 0.560 0.032 20000 0.495 0.031 -0.005 1.90 0.47

Mean + Indicator vs. Listwise Deletion

Mean + Indicator: Problems

  • In this case, adding an indicator reduced bias a good deal compared to mean imputation
  • Treats missingness as a fixed effect, so can exacerbate bias on other coefficients (e.g., \(\beta_E\)) correlated with missingness

Hot-Deck Imputation: Mechanics

  • For each missing nutrition value, randomly draw a donor mother who reported \(X\) and copy her score.
  • Mechanics: sample with replacement from observed \(X\)’s, replace the NA’s, and analyze \(Y \sim X + E\) on the filled in dataset.

Hot-Deck Imputation: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise N β_X SE(β_X) β_X bias X mean X sd
Full data 0% 20000 0.497 0.014 20000 0.497 0.014 -0.003 0.73 1.25
10% missing, E-driven 11% 17898 0.496 0.015 20000 0.392 0.014 -0.108 0.81 1.25
10% missing, X-driven 10% 18018 0.493 0.017 20000 0.344 0.015 -0.156 0.96 1.09
30% missing, E-driven 30% 14020 0.500 0.017 20000 0.263 0.013 -0.237 0.99 1.22
30% missing, X-driven 30% 13971 0.501 0.021 20000 0.257 0.017 -0.243 1.33 0.91
60% missing, E-driven 59% 8121 0.489 0.022 20000 0.177 0.013 -0.323 1.30 1.12
60% missing, X-driven 60% 8000 0.560 0.032 20000 0.141 0.020 -0.359 1.90 0.74

Hot-Deck Imputation vs. Listwise Deletion

Hot-Deck Imputation: Problems

  • In this case, performs worse than mean imputation, vastly exacerbating bias from listwise deletion
  • Works when missingness is MCAR; once it depends on \(E\) or \(X\), donor pools misrepresent the missing cases.
  • Still leads to overly tight SEs because draws are constrained to known cases. In our case, these known cases are from a tighter distribution than the truth.

Regression Imputation: Mechanics (Single Driver)

  • Estimate a predictive model \(\widehat{X} = \widehat{\alpha} + \widehat{\gamma} E\) using only mothers with observed nutrition.
  • \(\widehat{\alpha}\) is the fitted intercept (baseline nutrition when \(E = 0\)); \(\widehat{\gamma}\) is the estimated effect of \(E\) on nutrition.
  • Mechanics: use the observed \(E\) to predict \(X\) for every missing case, replace the NA with that prediction, and then run the outcome regression.
  • In our simulation, low-education mothers receive imputations near \(\widehat{\alpha}\) while high-education mothers get \(\widehat{\alpha} + \widehat{\gamma}\)—mirroring the DAG structure instead of collapsing to a single mean.

Regression Imputation: Mechanics (Multiple Drivers)

  • When several observed variables \(Z_1, Z_2, \ldots, Z_p\) relate to nutrition or missingness, fit \(\widehat{X} = \widehat{\alpha} + \sum_{j=1}^p \widehat{\gamma}_j Z_j\) among mothers who reported \(X\).
  • Each coefficient \(\widehat{\gamma}_j\) shows how that covariate shifts expected nutrition; the imputed value is the linear predictor for each missing case.
  • In richer data, the \(Z_j\) could include clinic context, SES, prior visits, etc.—anything observed that plausibly drives both nutrition and missingness.

Model-based Imputation Techniques

  • With rich data sets, you do not have to limit your missing data model to covariates in your analysis model
  • The idea is to model each variable with the most informative model you have
  • This is similar to propensity score models, where you model the treatment mechanism to produce matched sets or weights to achieve balance

Regression Imputation: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise N β_X SE(β_X) β_X bias X mean X sd
Full data 0% 20000 0.497 0.014 20000 0.497 0.014 -0.003 0.73 1.25
10% missing, E-driven 11% 17898 0.496 0.015 20000 0.496 0.015 -0.004 0.73 1.21
10% missing, X-driven 10% 18018 0.493 0.017 20000 0.493 0.017 -0.007 0.90 1.05
30% missing, E-driven 30% 14020 0.500 0.017 20000 0.500 0.017 0.000 0.73 1.13
30% missing, X-driven 30% 13971 0.501 0.021 20000 0.501 0.021 0.001 1.20 0.81
60% missing, E-driven 59% 8121 0.489 0.022 20000 0.489 0.023 -0.011 0.74 0.98
60% missing, X-driven 60% 8000 0.560 0.032 20000 0.560 0.033 0.060 1.72 0.53

Regression Imputation vs. Listwise Deletion

Regression Imputation: Problems

  • SEs are typically too optimistic: every missing X is deterministically set
  • Multiple imputation could add back the right amount of noise and uncertainty.
  • Remember: the goal is unbiased estimates of both \(\beta_X\) and \(SE(\beta_X)\)

Deterministic Imputation ≠ Power

  • Same line, more dots (Enders): Regression imputation replaces each missing \(X\) with its predicted value from \(E\), so the imputed points fall exactly on the fitted line. With correlation between the imputed \(X\) and \(E\) essentially one, those rows reinforce the complete-case relationship instead of adding new leverage.
  • Spread, not headcount, drives SEs: The standard error of a slope depends on the spread of the predictor, not the raw row count. Because all imputed \(X\)’s sit on that line, they add almost no variance, so the denominator of \(\mathrm{SE}(\hat\beta_X)\) barely changes. Worse, single-value imputation pretends those fills are certain, so standard errors can be too small (Gelman & Hill).
  • Where false confidence lives: The problem isn’t complete-case analysis—it’s single imputation that ignores the extra uncertainty. Complete cases lose power but report honest SEs; deterministically imputed cases keep \(N\) high but understate uncertainty and still inherit bias when missingness isn’t MCAR.

Where Do We Go From Here?

  • We need to put believable spread back into the missing rows rather than printing the same predicted value over and over.
  • Multiple imputation helps mitigate this problem

Multiple Imputation

  • Treat the missing nutrition scores as random draws from a model that predicts \(X\) using the information we do have (here, \(E\)).
  • Do that filling process multiple times, fit \(Y \sim X + E\) in each completed data set, and then average the estimates using Rubin’s rules.

Multiple Imputation: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise β_X (MI) SE (MI) β_X bias (MI)
Full data 0% 20000 0.497 0.014 0.497 0.014 -0.003
10% missing, E-driven 11% 17898 0.496 0.015 0.502 0.015 0.002
10% missing, X-driven 10% 18018 0.493 0.017 0.494 0.017 -0.006
30% missing, E-driven 30% 14020 0.500 0.017 0.495 0.017 -0.005
30% missing, X-driven 30% 13971 0.501 0.021 0.502 0.023 0.002
60% missing, E-driven 59% 8121 0.489 0.022 0.462 0.027 -0.038
60% missing, X-driven 60% 8000 0.560 0.032 0.553 0.028 0.053

Multiple Imputation vs. Listwise Deletion

Why Multiple Imputation Helps

  • Adds plausible wiggle: MI draws \(X^* = f(E) + \varepsilon\), where \(f(E)\) is the best guess for nutrition given \(E\) and \(\varepsilon\) is a random residual sampled from the estimated spread. That restores the variation we would have seen if \(X\) were observed.
  • Honest uncertainty: Rubin’s imputation combination rules blend the within- and between-imputation variance, so SEs stay truthful even when missingness is heavy.
  • Bias correction: Because each draw reflects how \(X\) and \(E\) relate, the averaged estimate lands near the true \(\beta_X\).

Multiple Imputation: Takeaways

  • MI lands on the true \(\beta_X\) even in the harsh scenario where values of X drive missingness.
  • The combined SEs stay honest because Rubin’s rules keep the run-to-run variation in the mix.
  • Next session we’ll dig into more details on how MI works and how to implement it yourself.

MNAR: Missing Not At Random

MNAR & Imputation

  • Multiple imputation is not a magic bullet
  • If you do not have rich enough covariates to accurately model p(missing X), there is no missing data strategy that can save your analysis.
  • Lets finish by examining such a scenario

MNAR DAG: Hidden Drivers

MNAR Stress Test Setup

  • Nutrition, income, provider quality, family stability, and class are all correlated.
  • We only observe education and insurance, so any missingness tied to the other pieces is effectively MNAR for us.
  • Lets compare listwise deletion vs. MI when we have a complex causal graph with unobserved data

MNAR Simulation Details

  • Simulated 2,000 mothers with seven covariates; truth: \(\beta_X = 0.50\) in \(Y \sim X + \text{education} + \text{insurance}\).
  • Missing nutrition is more likely for low income, low provider quality, fragile family, lower class position mothers.
  • Analysts only get \(Y\), education, insurance, and nutrition (with missingness).

MNAR Table 1: Who’s in Each Sample?

  • We can see here by using MI, our summary statistics look more accurate
  • The MAR illustration showed how listwise deletion recovered accurate model estimates
  • However, an accurate Table 1 of summary statistics also benefits from imputation!
Sample Mean X SD X Mean educ Mean insurance Mean Y SD Y
Full data 0.00 1.00 0.00 0.00 3.03 2.10
Listwise 0.28 0.97 0.33 0.29 3.45 2.04
MI (avg) 0.13 0.98 0.00 0.00 3.03 2.10

MNAR Model Specs

  • Truth (omniscient): Y ~ X + education + insurance + income + provider + family + class.
  • What we can actually fit: \(Y \sim X + \text{education} + \text{insurance}\) on whatever \(X\) values we have.
  • We compare three versions of that observed model: full data (for reference), listwise deletion, and MI.

MNAR Results: Listwise vs. MI

  • Omniscient model (with hidden drivers) nails \(\beta_X\), as expected.
  • Even with full nutrition but no hidden drivers, we drift because the omitted variables matter.
  • Once nutrition goes missing, both listwise and MI lean on education/insurance; MI can actually look worse than listwise deletion!
Method β_X SE Bias
Omniscient (X + hidden drivers) 0.484 0.052 -0.016
Observed, no missing 0.563 0.050 0.063
Listwise deletion 0.564 0.069 0.064
Multiple imputation 0.585 0.067 0.085

MNAR: No Easy Fix

  • Both listwise deletion and MI miss the truth because neither sees the hidden drivers of missingness.
  • MI can even overshoot when missingness leans on factors we never measure—its imputed \(X\) just mirrors education/insurance again.
  • Only extra information (new variables, external data, or sensitivity analyses) can break out of this corner.

Conclusion

Takeaways

  • Handling missing data well is all about good causal inference, on both Y and your missing Xs
  • No missing data strategy can fix an impoverished data set
  • You need to spend a good deal of time thinking about your causal graphs/DAGs before you implement imputation strategies
  • Don’t be afraid of simulating your specific data; you can’t observe your missing Xs, but you can simulate scenarios that tell you what would happen to your results under various conditions.

Thank You!

  • Erik Westlund
  • Johns Hopkins Biostatistics Center
  • ewestlund@jhu.edu